Run-Length FM-index

نویسندگان

  • Veli Mäkinen
  • Gonzalo Navarro
چکیده

The FM-index is a succinct text index needing only O(Hkn) bits of space, where n is the text size and Hk is the kth order entropy of the text. FM-index assumes constant alphabet; it uses exponential space in the alphabet size, σ. In this paper we show how the same ideas can be used to obtain an index needing O(Hkn) bits of space, with the constant factor depending only logarithmically on σ. Our space complexity becomes better as soon as σ log σ > log n, which means in practice for all but very small alphabets, even with huge texts. We retain the same search complexity of the FM-index. FM-index The FM-index [3] is based on the Burrows-Wheeler transform (BWT) [1], which produces a permutation of the original text, denoted by T bwt = bwt(T ). String T bwt is a result of the following forward transformation: (1) Append to the end of T a special end marker $, which is lexicographically smaller than any other character; (2) form a conceptual matrix M whose rows are the cyclic shifts of the string T$, sorted in lexicographic order; (3) construct the transformed text L by taking the last column of M. The first column is denoted by F . The suffix array A of text T$ is essentially the matrix M: A[i] = j iff the ith row of M contains string tjtj+1 · · · tn$t1 · · · tj−1. Given the suffix array, the search for the occurrences of the pattern P = p1p2 · · · pm is trivial. The occurrences form an interval [sp, ep] in A such that suffixes tA[i]tA[i]+1 · · · tn, sp ≤ i ≤ ep, contain the pattern as a prefix. This interval can be searched for using two binary searches in time O(m log n) [5]. The suffix array of text T is represented implicitly by T bwt. The novel idea of the FMindex is to store T bwt in compressed form, and to simulate a backward search in the suffix array as follows: Algorithm FM Search(P [1,m],T [1, n]) (1) c = P [m]; i = m; (2) sp = CT [c] + 1; ep = CT [c + 1]; (3) while (sp ≤ ep) and (i ≥ 2) do (4) c = P [i − 1]; (5) sp = CT [c] +Occ(T bwt , c, sp − 1)+1; (6) ep = CT [c] +Occ(T bwt , c, ep); (7) i = i − 1; (8) if (ep < sp) then return “not found” else return “found (ep − sp + 1) occs”.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Succinct Suffix Arrays Based on Run-Length Encoding

A succinct full-text self-index is a data structure built on a text T = t1t2 . . . tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2 . . . pm in T , and is able to reproduce any text substring, so the self-index replaces the text. Several remarkable self-indexes have been developed in recent years. The...

متن کامل

Compressed and Searchable Indexes for Highly Similar Strings (Invited Talk)

The collection indexing problem is defined as follows: Given a collection of highly similar strings, build a compressed index for the collection of strings, and when a pattern is given, find all occurrences of the pattern in the given strings. Since the index is compressed, we also need a separate operation which retrieves a specified substring of one of the given strings. Such a collection of ...

متن کامل

Fast Locating with the RLBWT

Indexing highly repetitive texts — such as genomic databases, software repositories and versioned text collections — has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) ...

متن کامل

A simple alphabet-independent FM-index

We design a succinct full-text index based on the idea of Huffmancompressing the text and then applying the Burrows-Wheeler transform over it. The resulting structure can be searched as an FM-index, with the benefit of removing the sharp dependence on the alphabet size, σ, present in that structure. On a text of length n with zero-order entropy H0, our index needs O(n(H0 + 1)) bits of space, wi...

متن کامل

Fast construction of FM-index for long sequence reads

SUMMARY We present a new method to incrementally construct the FM-index for both short and long sequence reads, up to the size of a genome. It is the first algorithm that can build the index while implicitly sorting the sequences in the reverse (complement) lexicographical order without a separate sorting step. The implementation is among the fastest for indexing short reads and the only one th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004